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Using  a  within-subjects  design,  we  administered  to  83  naval  pilots  and  flight  officers  computer- 
based  and  paper-based  tests  to  assess  recognition  of  aircraft  silhouettes  in  order  to  determine 
the  relative  reliabilities  and  validities  of  these  two  measurement  modes.  Estimates  of  internal 
consistencies,  equivalences,  and  discriminative  validities  were  computed  for  multiple  performance 
measures.  It  was  established  that  the  relative  reliabilities  and  validities  derived  for  these  two 
assessment  schemes  were  contingent  on  the  employed  mu'Hvpriate  measurement  criteria,  that 
is,  percentage  correct  responses,  average  response  latency,  and  average  degree  of  confidence  in 
recognition  judgments,  as  well  as  the  statistical  criteria  used  to  ascertain  the  comparative  qual¬ 
ity  of  these  two  modes  of  testing. 


The  consequences  of  computer-based  assessment  on  ex¬ 
aminees’  performance  are  not  obvious.  The  investigations 
that  have  been  conducted  on  this  topic  have  produced 
mixed  results.  Some  studies  (D.  F.  Johnson  &  Mihal, 
1973;  Serwer  &  Stolurow,  1970)  demonstrated  that  test 
takers  do  better  on  verbal  items  given  by  computer-based 
tests  than  they  do  on  paper-based  tests;  however,  just  the 
opposite  was  found  by  other  studies  (D.  F.  Johnson  & 
Mihal,  1973;  Wildgrube,  1982).  One  investigation  (Sachar 
&  Fletcher,  1978)  yielded  no  significant  differences  result¬ 
ing  from  computer-based  and  paper-based  modes  of  ad¬ 
ministration  on  verbal  items.  Two  studies  (English,  Reck- 
ase,  &  Patience,  1977;  Hoffman  &  Lundberg,  1976) 
demonstrated  that  these  two  testing  modes  did  not  affect 
performance  on  memory  retrieval  items.  Sometimes 
(D.  F.  Johnson  &  Mihal,  1973)  test  takers  do  better  on 
quantitative  tests  when  they  are  computer  given,  some¬ 
times  (Lee,  Moreno,  &  Sympson,  1984)  they  do  worse, 
and  other  times  (Wildgrube,  1982)  it  may  make  no  differ¬ 
ence.  Other  studies  have  supported  the  equivalence  of 
computer-based  and  paper-based  administration  (Elwood 
&  Griffin,  1972;  Hedl,  O’Neil,  &  Hansen,  1973;  Kan- 
tor,  1988;  Lukin,  Dowd,  Plake,  &  Kraft,  1985).  Some 
researchers  (Evan  &  Miller,  1969;  Koson,  Kitchen, 
Kochen,  &  Stodolosky,  1970;  Lucas,  Mullin,  Luna,  & 
Mclnroy,  1977;  Lukin  et  al.,  1985;  Skinner  &  Allen, 
1983)  have  reported  psychometric  capabilities  of 
computer-based  assessment  to  be  comparable  or  superior 
to  paper-based  assessment  in  clinical  settings. 
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Investigations  of  computer-based  presentation  of  per¬ 
sonality  items  have  yielded  reliability  and  validity  indices 
comparable  to  those  obtained  with  typical  paper-based 
presentation  (Katz  &  Dalby,  1981;  Lushene,  O’Neil,  & 
Dunn,  1974).  No  significant  differences  were  found  in 
the  scores  of  measures  of  anxiety,  depression,  and  psy¬ 
chological  reactance  due  to  computer-based  and  paper- 
based  administration  (Lukin  et  al.,  1985).  Studies  of  cog¬ 
nitive  tests  have  provided  inconsistent  findings,  with  so...e 
studies  (Hitti,  Riffer,  &  Stuckless,  1971;  Rock  &  Nolen, 
’  982)  demonstrating  that  the  computerized  version  is  a 
viable  alternative  to  the  paper-based  version.  Other 
research  (Hansen  &  O’Neil,  1970;  Hedl  et  al.,  1973; 
D.  F.  Johnson  &  White,  1980;  J.  H.  Johnson  &  K.  N. 
Johnson,  1981),  though,  indicated  that  interacting  with 
a  computer-based  system  to  take  an  intelligence  test  could 
elicit  a  considerable  amount  of  anxiety,  which  could  af¬ 
fect  performance. 

Regarding  computerized  adaptive  testing  (CAT),  some 
empirical  comparisons  (McBride,  1980;  Sympson,  Weiss, 
&  Ree,  1982)  yielded  essentially  no  change  in  validity 
due  to  mode  of  administration.  However,  test-item 
difficulty  may  not  be  indifferent  to  manner  of  presenta¬ 
tion  for  CAT  (Green,  Bock,  Humphreys,  Linn,  &  Reck- 
ase,  1984).  The  effect  of  switching  from  paper-based  to 
computer-based  administration  is  thought  to  have  three 
aspects;  (1)  an  overall  mean  shift,  in  which  all  items  may 
be  easier  or  harder;  (2)  an  item-mode  interaction,  in 
which  a  few  items  may  be  altered  and  others  not;  and 
(3)  the  nature  of  the  task  itself,  which  may  be  changed 
by  computer  administration.  A  computer  simulation  study 
(Divgi,  1988)  demonstrated  that  a  CAT  version  of  the 
Armed  Services  Vocational  Aptitude  Battery  had  a  higher 
reliability  than  a  paper-based  version  for  these  subtests: 
general  science,  arithmetic  reasoning,  word  knowledge, 
paragraph  comprehension,  and  mathematic*  knowledge 
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The  inconsistent  results  of  mode,  manner,  or  medium  of 
testing  may  be  due  to  differences  in  methodology,  test  con¬ 
tent,  population  tested,  or  the  design  of  the  study  (Lee 
et  al.,  1984). 

With  computer  costs  decreasing  and  people's  knowledge 
of  these  systems  increasing,  it  becomes  more  likely  eco¬ 
nomically  and  technologically  that  many  benefits  can  be 
gained  from  their  use.  A  direct  advantage  of  computer- 
based  testing  is  that  individuals  can  respond  to  items  at 
their  own  pace,  thus  producing  ideal  power  tests.  Some 
indirect  advantages  of  computer-based  assessment  are  in¬ 
creased  test  security,  less  ambiguity  about  students’ 
responses,  minimal  or  no  paperwork,  immediate  scoring, 
and  automatic  recordkeeping  for  item  analysis  (Green, 
1983a,  1983b).  Some  of  the  strongest  support  for  computer- 
based  assessment  is  based  on  the  awareness  of  faster  and 
more  economical  measurement  (El wood  &  Griffin,  1972; 
D.  F.  Johnson  &  White,  19SG,  Space,  1921).  Cory  (1977) 
reported  some  advantages  of  computerized  testing  over 
paper-based  testing  for  predicting  job  performance. 

Ward  (1984)  stated  that  computers  can  be  employed  to 
augment  what  is  possible  with  paper-based  measurement 
(e.g. ,  to  obtain  more  precise  information  regarding  a  stu¬ 
dent  than  is  likely  with  more  customary  measurement 
methods)  and  to  assess  additional  aspects  of  performance. 
He  enumerated  and  discussed  potential  benefits  that  may 
be  derived  by  using  computer-based  systems  to  administer 
traditional  tests.  Some  of  these  are  as  follows:  (1)  individ¬ 
ualizing  assessment,  (2)  increasing  the  flexibility  and  effi¬ 
ciency  of  managing  test  information,  (3)  enhancing  the  eco¬ 
nomic  value  and  manipulation  of  measurement  databases, 
and  (4)  improving  diagnostic  testing.  Millman  (1984) 
agreed  with  Ward  that  computer-based  measurement  en¬ 
courages  individualized  assessment  and  that  designing 
software  within  the  context  of  cognitive  science  is  impor¬ 
tant.  Also,  limiting  computer-based  assessment  is  not  so 
much  hardware  inadequacy  but  incomplete  comprehension 
of  the  processes  intrinsic  to  testing  (Federico,  1980). 

Many  benefits  may  be  obtained  from  computerized  test¬ 
ing.  Some  of  these  may  be  related  to  attitudes  and  assump¬ 
tions  associated  with  the  use  of  novel  media  or  innova¬ 
tive  technology  per  se.  However,  and  just  as  readily, 
potential  problems  may  result  from  the  use  of  computer- 
based  measurement.  Differences  between  this  mode  of  as¬ 
sessment  and  traditional  testing  techniques  may  or  may 
not  affect  the  reliability  and  validity  of  measurement. 
Notably  absent  from  this  literature  are  studies  dial  have 
compared  the  testing  characteristics  of  computer-based  as¬ 
sessment  with  customary  measurement  methods  for  as¬ 
sessing  recognition  performance. 

One  discernible  difference  between  employing  a 
computer-based  method  versus  a  paper-based  method  for 
recognition  testing  of  shapes,  silhouettes,  or  spatial  forms 
is  the  degree  of  control  over  stimulus  presentation,  ex¬ 
posure,  or  duration.  Complete  control  of  stimulus  ex¬ 
posure  is  possible  when  using  a  computer-based  method, 
whereas  ctrL»  n^mpuiation  of  stimulus  presentation  is 
practically  impossible  or  intrinsically  lacking  when  us¬ 
ing  a  paper-based  method.  One  might  expect  or  hypothe¬ 


size  that  the  longer  the  stimulus  presentation  or  viewing 
time  of  a  test  item  available  to  the  subject  during  paper- 
based  recognition  assessment,  the  more  reliable  and  valid 
the  measurement  compared  with  computer-based  recog¬ 
nition  assessment.  This  assumes  that  the  exposure  of  each 
shape  or  silhouette  item  during  the  computer-based  test 
is  approaching  tachistoscopic  durations.  If  brief  exposures 
of  figural  forms  are  employed  during  computer-based 
recognition  testing,  subjects  may  not  have  sufficient  search 
time  to  detect  or  identify  distinctive  characteristics,  fea¬ 
tures,  or  attributes,  which  are  necessary  for  correct  recog¬ 
nition.  A  salient  research  issue  that  should  be  addressed 
is  the  specification  of  some  of  the  important  psychomet¬ 
ric  implications  of  employing  computer-based  versus 
paper-based  procedures  for  measuring  recognition  per¬ 
formance.  The  primary  purpose  of  this  research  was  to 
evaluate  empirically  the  relative  reliability  and  validity 
of  computer-based  and  paper-based  procedures  for  assess¬ 
ing  recognition  of  aircraft  silhouettes. 

METHOD 

Subjects 

The  subjects,  who  volunteered  to  participate  in  this 
study,  were  83  male  student  pilots  and  radar  intercept 
officers  from  the  Fleet  Replacement  Squadron,  VF-I24, 
Naval  Air  Station  Miramar.  These  students  must  learn  to 
recognize  or  identify  Soviet  and  non-Soviet  aircraft  sil¬ 
houettes  so  that  they  can  properly  employ  the  F- 14  fighter. 

Subject  Matter 

The  subject  matter  consisted  of  line  drawings  of  the 
front,  side,  and  top  silhouettes  of  Soviet  and  non-Soviet 
aircraft.  A  paper-based  study  guide  was  designed  and  de¬ 
veloped  for  the  subjects  to  help  them  learn  to  recognize 
the  silhouettes  of  4  Soviet  naval  air  bombers  and  10  of 
their  front-line  fighters.  Silhouettes  of  non-Soviet  aircraft 
were  also  presented,  since  these  could  be  mistaken  for 
Soviet  threats  or  vice  versa. 

The  subjects  were  asked  to  study  each  Soviet  silhouette 
and  its  corresponding  non-Soviet  silhouette  and  note  the 
distinctive  features  of  each.  The  correct  identification  of 
each  Soviet  and  non-Soviet  silhouette  according  to  NATO 
name  and  alphanumeric  designator  (e.g..  Foxhound  or 
Mig-31)  appeared  directly  below  it.  The  subjects  were 
told  that  in  the  near  future,  their  recognition  of  these 
Sv/viet  <md  non  Soviet  aircraft  would  be  assessed  via  com¬ 
puter  and  traditional  testing. 

In  addition  to  using  the  paper-based  study  guide,  the 
subjects  were  required  to  leam  the  silhouettes  via  the 
computer-based  system  described  below,  which  was  con¬ 
figured  in  a  training  mode  for  this  purpose.  In  this  mode, 
when  a  student  pressed  a  key.  a  silhouette  would  reap¬ 
pear  together  with  its  correct  identification  so  that  they 
could  be  associated. 

Computer-Based  Assessment 

Computerized  line  drawings  were  used  to  assess  how 
well  the  subjects  recognized  or  identified  the  silhouettes 
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These  were  digitized  facsimiles  of  those  employed  in  the 
paper-based  study  guide.  A  computer  game  based  on  a 
sequential  recognition  paradigm  was  developed  (Little, 
Maffly,  Miller,  Setter,  &  Federico,  1985).  It  randomly 
selected  and  presented  on  a  computer  display,  at  an  ar¬ 
bitrary  exposure  setting,  the  front,  side,  or  top  views  of 
4  Russian  bombers  and  10  of  their  advanced  Fighters.  For 
this  research,  the  exposure  of  a  silhouette  on  the  com¬ 
puter  screen  was  approximately  500  msec.  Also,  the  game 
management  system  can  choose  and  flash  corresponding 
silhouettes  of  NATO  aircraft,  which  act  as  distractors  be¬ 
cause  of  their  high  degree  of  similarity  to  the  Soviet  sil¬ 
houettes.  The  subjects’  task  was  to  identify  as  quickly  as 
possible  the  aircraft  that  was  represented  by  each  sil¬ 
houette.  The  subjects  entered  on  the  keyboard  what  they 
recognized  each  aircraft  to  be,  using  its  NATO  name  or 
corresponding  alphanumeric  designation.  Misspellings 
counted  as  wrong  responses. 

This  particular  computer-based  game  or  test  assesses 
student  performance  by  measuring  the  "hit  rate,”  or  num¬ 
ber  of  correct  recognitions,  out  of  a  total  of  42  silhouettes, 
half  of  which  are  Soviet  and  the  other  half  non-Soviet; 
the  time,  or  latency,  it  takes  a  student  to  make  a  recogni¬ 
tion  judgment  for  each  target  or  distractor  aircraft;  and 
the  degree  of  confidence  the  student  has  in  each  of  his 
recognition  decisions.  At  the  end  of  the  game,  feedback 
is  given  to  the  student  in  terms  of  his  hit  rate  (computer- 
based  total  percentage  correct  responses,  or  CTP),  aver¬ 
age  response  latency  (computer-based  total  average 
response  latency,  or  CTL),  average  degree  of  confidence 
in  his  recognition  judgments  (computer-based  total  aver¬ 
age  degree  of  confidence,  or  CTC),  and  how  his  perfor¬ 
mance  compares  to  other  students  who  have  played  the 
game. 

Paper-Based  Assessment 

Since  the  computer-based  test  randomly  selected  and 
presented  silhouettes,  creating  a  distinct  sequence  of  test 
items  or  form  for  each  subject,  an  attempt  was  made  to 
simulate  this  computer-based  administration  of  different 
forms  by  employing  different  paper-based  forms.  Con¬ 
sequently,  two  alternative  forms  of  a  paper-based  test  were 
designed  and  developed  to  assess  the  subjects'  recogni¬ 
tion  of  the  silhouettes  mentioned  above.  The  alternative 
test  forms  mimicked  as  much  as  possible  the  format  used 
by  the  computer-based  test.  Also,  these  paper-based  al¬ 
ternative  forms  employed  as  individual  items  facsimiles 
of  the  digitized  silhouettes  used  in  their  computer  coun¬ 
terparts.  Both  test  forms  were  presented  as  booklets,  each 
containing  42  items  representing  the  front,  top,  or  side 
silhouettes  of  aircraft.  The  subjects’  task  was  to  identify 
as  quickly  as  possible  the  aircraft  that  was  represented 
by  each  item’s  silhouc'.ic.  They  'vcrc  .**-.kcd  write  ;r. 
the  space  provided  what  they  recognized  the  aircraft  to 
be,  using  its  NATO  name  or  corresponding  alphanumeric 
designation.  Misspellings  counted  as  wrong  responses. 
The  subjects  were  instructed  not  to  turn  back  to  previous 
pages  in  the  test  booklet  to  complete  items  they  had  left 
blank  The  students  were  encouraged  to  go  through  the 


test  items  quickly  to  approximate  as  much  as  possible  the 
silhouette  exposure  employed  by  the  computer-based  test. 
The  subjects  were  closely  monitored  to  assure  that  they 
complied  with  this  procedure. 

After  the  subjects  wrote  down  what  they  thought  an  air¬ 
craft  was,  they  were  required  to  indicate  on  a  scale  that 
appeared  below  each  silhouette  the  degree  of  confidence 
in  their  recognition  decision  concerning  the  specific  item. 
Like  the  confidence  scale  used  for  the  computer-based  test, 
this  one  went  from  least  confident,  or  0%  confidence,  in 
their  recognition  decision  on  the  left,  to  most  confident, 
or  100%  confidence,  on  the  right,  using  a  10-point  scale. 
The  subjects  were  instructed  to  use  this  confidence  scale 
by  placing  a  check  mark  at  the  point  that  best  reflected 
or  approximated  the  sureness  they  had  in  their  judgment. 
To  leam  how  to  respond  properly  to  the  silhouette  test 
items,  the  subjects  were  asked  to  look  at  three  completed 
examples.  A  subject’s  percentage  of  correct  recognitions 
(paper-based  total  percentage  correct  responses,  or  PTP) 
and  average  degree  of  confidence  (paper-based  total  aver¬ 
age  degree  of  confidence,  or  PTC)  for  the  paper-based 
test  were  measured  and  recorded. 

Procedure 

Prior  to  testing,  the  subjects  learned  to  recognize  the 
aircraft  silhouettes  using  two  media:  (1)  a  paper-based 
form  structured  as  a  study  guide,  and  (2)  a  computer- 
based  form  using  the  system  in  the  training  mode,  as  men¬ 
tioned  above.  Mode  of  assessment,  computer-based  or 
paper-based,  was  manipulated  as  a  within-subjects  varia¬ 
ble  (Kirk,  1968).  All  subjects  were  administered  the 
paper-based  test  before  the  computer-based  test.  The  two 
forms  of  the  paper-based  tests  were  counterbalanced  or 
alternated  in  their  administration  to  the  subjects.  After  the 
subjects  received  the  paper-based  test,  they  were  immedi¬ 
ately  administered  the  computer-based  test.  It  was  as¬ 
sumed  that  a  subject’s  state  of  recognition  knowledge  was 
the  same  during  the  administration  of  both  tests.  The  sub¬ 
jects  took  approximately  10-15  min  to  complete  the 
paper-based  test  and  15-20  min  to  complete  the  computer- 
based  test.  This  difference  in  completion  time  was  primar¬ 
ily  due  to  lack  of  typing  proficiency  among  some  of  the 
subjects. 

Reliabilities  for  both  modes  of  testing  were  estimated 
by  deriving  internal  consistency  indices  using  an  odd-even 
item  split.  These  reliability  estimates  were  adjusted  by 
employing  the  Spearman -Brown  Prophecy  Formula 
(Thorndike,  1982).  Reliability  estimates  were  calculated 
fer  test  score,  average  degree  of  confidence,  and  aver¬ 
age  response  latency  for  the  computer-based  test:  relia¬ 
bility  estimates  were  calculated  only  for  test  score  and 
average  degree  of  confidence  for  the  paper-based  test.  Es- 
were  not  computed  for  average  response  latency 
since  this  was  not  measured  for  the  paper-based  test. 
Equivalences  between  these  two  modes  of  assessment 
were  estimated  by  Pearson  product-moment  correlations 
for  total  test  score  and  average  degree  of  confidence 

To  derive  discriminative  validity  estimates,  we  placed 
the  research  subjects  into  two  groups  according  to  whether 
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or  not  their  performance  through  the  squadron's  curricu¬ 
lum  was  above  or  below  the  mean  grade  for  this  sample. 
A  stepwise  discriminant  analysis  (Dillon  &  Goldstein, 
1984),  using  Wilks’s  criterion  for  including  and  reject¬ 
ing  variables,  and  associated  statistics  were  computed  to 
ascertain  how  well  computer-based  and  paper-based  mea¬ 
sures  distinguished  among  the  defined  groups  expected 
to  differ  in  their  recognition  of  aircraft  silhouettes. 

RESULTS 

Reliability  and  Equivalence  Estimates 

The  means,  standard  deviations,  intercorrelations,  and 
associated  statistics  for  the  computer-based  and  paper- 
based  measures  of  recognition  performance  are  presented 
in  Table  1 .  As  can  be  seen,  the  subjects’  recognition  per¬ 
formance  was  significantly  better,  and  they  had  signifi¬ 
cantly  more  confidence,  on  the  paper-based  test  than  on 
the  computer-based  test.  On  both  the  computer-based  and 
paper-based  versions,  the  subjects’  recognition  perfor¬ 
mance  correlated  significantly  and  positively  with  confi¬ 
dence  in  their  identification  judgments.  For  the  computer- 
based  mode,  the  subjects’  response  latency  varied  in¬ 
versely  and  significantly  with  their  recognition  perfor¬ 
mance  and  confidence.  It  appears  that  recognition  per¬ 
formance  and  confidence  were  more  strongly  associated 
for  the  paper-based  test  than  for  its  computer-based  coun¬ 
terpart. 

Split-half  reliability  and  equivalence  estimates  of 
computer-based  and  paper-based  measures  of  recognition 
performance  are  presented  in  Table  2.  The  adjusted  relia¬ 
bility  estimates  are  relatively  high,  ranging  from  .89  to 
.97.  The  difference  in  reliabilities  for  computer-based  and 
paper-based  measures  for  average  degree  cf  confidence 
was  statistically  significant  (p  <  .02),  using  a  test 
described  by  Edwards  (1964,  p.  85).  However,  the  differ¬ 
ence  in  reliabilities  for  computer-based  and  paper-based 
measures  of  the  recognition  test  score  was  not  significant. 
These  results  revealed  that  ( 1 )  the  computer-based  and 
paper-based  measures  of  test  score  were  not  significandy 
different  in  reliability  or  internal  consistency,  and  (2)  the 
paper-based  measure  of  average  degree  of  confidence  was 


Table  t 

Means,  Standard  Deviations,  and  Intercorrelations 
of  Computer-Based  and  Paper-Based  Measures 
of  Recognition  Performance 


M 

SD 

CTP 

CTC 

CTL 

CTP 

77.39t 

20.59 

CTC 

89.9M 

12  89 

.57 

CTL 

I//9.51 

1493  ^ 

-.22* 

-.45 

PTP 

83.81 t 

17.C5 

.67 

.72 

-.33 

rvre 

92.14+ 

48 

81 

-41 

Note—CTP  =  computer-based  total  percentage  correct  responses;  CTL 
=  computer-based  total  average  response  latency;  CTC  =  computer- 
based  total  average  degree  of  c  mfuknce:  PTP  =  paper-based  total  per¬ 
centage  correct  responses;  PTC  =  paper-based  total  average  degree  of 
confidence  CTL  was  measured  in  milliseconds.  r(81)  >  .27. 
p  <  005  *r(8l)  >  .21.  p  <  .025  t/(82>  =  -3.77,  p  <  .001 

tH 82)  =  -2  70,  p  =  .008 


Table  2 

Split-Half  Reliability  and  Equivalence  Estimates  of  Computer-Based 
and  Paper-Based  Measures  of  Recognition  Performance 


Measure 

Reliability 

Computer-Based  Paper-Based 

Equivalence 

Score 

90 

89 

.67 

Confidence 

95 

.97 

81 

Latency 

93 

Note— Split-half  reliability  estimates  were  adjusted  by  using  the  Spear¬ 
man-Brown  Prophecy  Formula 


more  reliable  or  internally  consistent  than  the  computer- 
based  measure. 

Estimates  of  equivalence  between  corresponding 
computer-based  and  paper-based  measures  of  recognition 
test  score  and  average  degree  of  confidence  were  .67  and 
.81,  respectively.  These  suggested  that  the  computer- 
based  and  paper-based  measures  had  from  45%  to  66% 
variance  in  common,  implying  that  these  different  modes 
of  assessment  were  only  partially  equivalent.  The  equiva¬ 
lences  for  test  score  and  average  degree  of  confidence 
measures  were  significantly  different  (p  <  .001).  This 
result  suggested  that  computer-based  and  paper-based 
measures  of  average  degree  of  confidence  were  more 
equivalent  than  the  measures  of  recognition  test  score 

Discriminative  Validity 

The  discriminant  analysis  was  computed  to  determine 
how  well  computer-based  and  paper-based  measures  of 
recognition  performance  differentiated  groups  defined  by 
above  or  below  mean  average  curriculum  grade.  The 
statistics  associated  with  the  single  significant  function, 
standardized  discriminant-function  coefficients,  pooled 
within-groups  correlations  between  the  discriminant  func¬ 
tion  and  computer-based  and  paper-based  measures,  and 
group  centroids  for  above  or  below  mean  average  cur¬ 
riculum  grade  are  presented  in  Table  3.  The  discriminant- 
function  coefficients,  which  consider  the  interrelationships 
or  interdependencies  among  the  multivariate  measures, 
revealed  the  relative  contribution  or  comparative  impor¬ 
tance  of  the  variables  in  defining  this  derived  dimension 
to  be  CTC,  PTC,  PTP,  CTP,  and  CTL,  respectively.  The 
within-groups  correlations,  which  were  computed  for  each 
individual  measure  partialiing  out  the  interrelationships 
of  all  the  other  variables,  indicated  that  the  major  contri¬ 
butors  to  the  discriminant  function  were  CTP,  CTC.  and 
CTL,  respectively,  all  computer-based  measures. 

The  means  and  standard  deviations  for  groups  above 
or  below  mean  average  curriculum  grade,  univariate  F 
ratios,  and  levels  of  significance  for  computer-based  and 
paper-based  measures  of  recognition  performance  are 
summarized  in  Table  4.  Conside';ng  mca-ures 
univariate  variables  (i.e.,  independent  of  their  multivari¬ 
ate  relationships  or  dependencies  with  one  another),  these 
statistics  revealed  that  one  computer-based  measure,  CTL, 
and  one  paper-based  measure,  PTC,  significantly  differen¬ 
tiated  the  two  groups.  The  means  revealed  that  the  group 
above  mean  curriculum  grade  had  shorter  computer-based 
latencies  than  the  group  below  mean  curriculum  grade. 
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Table  3 

Statistics  Associated  with  the  Significant  Discriminant  Function, 
Standardized  Discriminant-Function  Coefficients,  Pooled 
Within-Groups  Correlations  Between  the  Discriminant 
Function  and  Computer-Based  and  Paper-Based 
Measures,  and  Group  Centroids  for  Above  or 
Below  Mean  Curriculum  Grade 


Discriminant  Function 


Eigen 

Value 

Canonical 

Correlation 

Wilks's 

Lambda 

Chi- 

Square 

df 

P 

14 

.35 

.88 

9.98 

5 

076 

Measure 

Discriminant  Within-Group 
Coefficient  Correlation 

Group 

Centroid 

CTP 

-.60 

.60 

Above  Mean  Grade 

.32 

CTC 

-.97 

-  .55 

Below  Mean  Grade 

-.42 

CTL 

-.52 

48 

PTP 

80 

25 

PTC 

94 

-  03 

Table  4 

Means  and  Standard  Deviations  for  Groups  Above  or  Below  Mean 
Grade.  Univariate  F  Ratios,  and  Levels  of  Significance 
for  Computer-Based  and  Paper-Based  Measures 


Group 


Above  Mean 

Below  Mean 

Grade 

Grade 

Measure 

II 

■fe. 

->4 

(n  =  36) 

F 

p 

CTP 

M 

77.19 

77.64 

01 

92 

SD 

18.48 

23.33 

CTC 

M 

90  99 

88.54 

73 

39 

SD 

12.74 

13.06 

CTL 

M 

1,522.06 

2,115.61 

3  31 

07 

SD 

1,554  12 

1.359  19 

PTP 

M 

86.40 

80.42 

2.56 

11 

SD 

16  65 

17  T9 

PTC 

M 

SD 

94.09 

9.47 

89,61 

11  09 

3  92 

05 

and  that  the  former  group  had  a  higher  paper-based  aver¬ 
age  degree  of  confidence  than  the  latter  group. 

The  discriminant-function  coefficients  and  group  means 
also  implied  that  the  students  with  above  mean  average 
grades  ( 1 )  did  relatively  well  on  the  paper-based  test 
(PTP)  and  relatively  poorly  on  the  computer-based  test 
(CTP),  and  (2)  had  more  confidence  in  their  paper-based 
perfo.mance  fPTC)  than  their  computer-based  perfor¬ 
mance  (CTC).  These  statistics  together  with  the 
discriminant-function  coefficients  and  group  means 
reported  for  CTL  and  CTP  as  well  as  the  correlations  be¬ 
tween  CTL  and  CTP  and  CTC  suggested  that  there  may 
have  been  a  sligM  'p^ed-accuracy  tradeoff  for  computer- 
based  recognition  testing. 

In  general,  the  multivariate  and  subsequent  univariate 
results  established  that  according  to  two  sets  of  criteria, 
the  discriminant  coefficients  and  F  ratios  and  correspond¬ 
ing  means,  the  discriminant  validities  of  computer-based 
and  paper-based  measures  were  about  the  same  for  dis¬ 


tinguishing  groups  above  or  below  mean  average  curric¬ 
ulum  grade.  However,  according  to  another  set  of  crite¬ 
ria,  the  pooled  within-groups  correlations  between  the 
discriminant  function  and  the  computer-based  and  paper- 
based  measures,  the  former  had  superior  discriminative 
validity  to  the  latter. 

DISCUSSION 

This  study  established  that  the  relative  reliability  of 
computer-based  and  paper-based  measures  depends  on  the 
specific  criterion  assessed.  That  is,  regarding  the  recog¬ 
nition  test  score  itself,  it  was  found  that  computer-based 
and  paper-based  measures  were  not  significantly  differ¬ 
ent  in  reliability  or  internal  consistency.  However,  regard¬ 
ing  the  average  degree  of  confidence  in  recognition  judg¬ 
ments.  it  was  found  that  the  paper-based  measure  was 
more  reliable  or  internally  consistent  than  its  computer- 
based  counterpart.  The  extent  of  the  equivalence  between 
these  two  modes  of  measurement  was  contingent  on  par¬ 
ticular  performance  criteria.  It  was  demonstrated  that  the 
equivalence  of  computer-based  and  paper-based  measures 
of  average  degree  of  confidence  was  greater  than  that  for 
recognition  test  score.  The  relative  discriminative  valid¬ 
ity  of  computer-based  and  paper-based  measures  was  de¬ 
pendent  on  the  specific  statistical  criteria  selected.  The 
discriminant  coefficients,  F  ratios,  and  corresponding 
means  indicated  that  the  validities  of  computer-based  and 
paper-based  measures  were  about  the  same  for  distinguish¬ 
ing  groups  above  or  below  mean  curriculum  grade. 
However,  according  to  another  set  of  criteria,  the  pooled 
within-groups  correlations  between  the  discriminant  func¬ 
tion  and  computer-based  and  paper-based  measures,  the 
former  had  better  validity  than  the  latter. 

Even  though  the  subjects  had  more  time  to  view  each 
silhouette  during  the  paper-based  test  than  during  the 
computer-based  test,  recognition  scores  for  the  two  mea¬ 
surement  modes  were  not  significantly  different  in  relia¬ 
bility.  However,  the  longer  exposures  of  paper-based  as¬ 
sessment  seemed  to  have  improved  the  reliability  of 
measuring  the  subjects’  degree  of  confidence  in  their 
recognition  judgments  compared  with  computer-based  as¬ 
sessment.  As  hypothesized,  the  longer  exposures  intrin¬ 
sic  to  the  paper-based  method  seemed  to  have  facilitated 
the  subjects’  recognition  scores.  They  performed  signifi¬ 
cantly  better  on  the  paper-based  test  than  on  the  computer- 
based  test.  Also,  it  appears  that  the  difference  in  the  sil¬ 
houette  exposures  of  the  two  testing  methods  had  greater 
impact  on  the  equivalence  of  recognition  test  score  than 
on  the  average  degree  of  confidence.  This  seemed  so. 
since  the  equivalence  of  the  computer  based  and  paper- 
based  measures  of  recognition  performance  was  signifi¬ 
cantly  less  than  the  degree  of  confidence.  The  inherent 
difference  in  silhouette  viewing  times  between  the 
computer-based  and  paper-based  assessment  modes  was 
expected  to  affect  the  recognition  process  itself  more  than 
the  degree  of  confidence  in  identification  judgment. 

The  results  of  this  research  supported  the  findings  of 
some  studies,  but  not  others.  Federico  and  Liggett  ( 1988. 
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1989)  administered  computer-based  and  paper-based  tests 
of  semantic  knowledge  (Liggett  &  Federico,  1986)  to  de¬ 
termine  the  relative  reliability  and  validity  of  these  two 
modes  of  assessment.  Estimates  of  internal  consistencies, 
equivalences,  and  discriminant  validities  were  computed. 
They  established  that  computer-based  and  paper-based 
measures  (i.e. ,  test  score  and  average  degree  of  confi¬ 
dence)  were  not  significantly  different  in  reliability  or  in¬ 
ternal  consistency.  This  finding  partially  agrees  with  the 
corresponding  result  of  the  present  study,  since  computer- 
based  and  paper-based  measures  of  test  score  were  found 
to  be  equally  reliable;  however,  the  computer-based  mea¬ 
sure  of  average  degree  of  confidence  was  found  to  be  less 
reliable  than  its  paper-based  counterpart.  A  few  of  the 
Federico  and  Liggett  findings  were  ambivalent,  since 
some  results  suggested  that  equivalence  estimates  for 
computer-based  and  paper-based  measures  (i.e.,  test  score 
and  average  degree  of  confidence)  were  about  the  same, 
and  another  suggested  that  these  estimates  were  differ¬ 
ent.  Some  of  their  reported  results  are  different  from  those 
established  in  the  present  study,  in  which  computer-based 
and  paper-based  measures  of  test  score  were  less  equiva¬ 
lent  than  these  measures  of  average  degree  of  confidence. 
Finally,  Federico  and  Liggett  demonstrated  that  the  dis¬ 
criminative  validity  of  the  computer-based  measures  was 
superior  to  that  of  the  paper-based  measures.  This  result 
is  in  partial  agreement  with  that  found  in  the  present 
research,  where  it  was  also  established  with  respect  to 
some  statistical  criteria.  However,  according  to  other  cri¬ 
teria,  the  discriminative  validity  of  computer-based  and 
that  of  paper-based  measures  were  about  the  same. 

The  results  of  our  present  research  supported  some 
studies,  but  not  others.  Hofer  and  Green  (1985)  were  con¬ 
cerned  that  computer-based  assessment  would  introduce 
irrelevant  or  extraneous  factors  that  would  likely  degrade 
test  performance  These  computer-correlated  factors  may 
alter  the  nature  of  the  task  to  such  a  degree  that  it  would 
be  difficult  for  a  computer-based  test  and  its  paper-based 
counterpart  to  measure  the  same  construct  or  content.  This 
could  affect  reliability,  validity,  and  normative  data,  as 
well  as  other  assessment  attributes.  Several  plausible  rea¬ 
sons,  according  to  Hofer  and  Green,  may  contribute  to 
different  performances  on  these  distinct  kinds  of  testing: 
( 1 )  state  of  anxiety  instigated  when  confronted  by  computer- 
based  testing,  (2)  lack  of  computer  familiarity  on  the  part 
of  the  test  taker,  and  (3)  changes  in  response  format  re¬ 
quired  by  the  two  modes  of  assessment.  These  different 
dimensions  could  result  in  tests  that  are  nonequivalent; 
however,  in  our  present  research  these  diverse  factors  had 
no  apparent  impact. 

On  the  other  hand,  a  number  of  known  differences  be¬ 
tween  computer-based  and  paper-based  assessment  that 
may  affect  equivalence  and  validity  (Green,  1986). 

( 1 )  Passive  omitting  of  items  is  usually  not  permitted  on 
computer-based  tests.  An  individual  must  respond  in  a 
different  manner  than  on  paper-based  tests.  (2)  Computei- 
ized  tests  typically  do  not  permit  backtracking.  The  test 
taker  cannot  easily  review  items,  alter  responses,  or  de¬ 


lay  attempting  to  answer  questions.  (3)  The  capacity  of 
the  computer  screen  can  have  an  impact  on  what  usually 
are  long  test  items,  for  example,  paragraph  comprehen¬ 
sion.  These  may  be  shortened  to  accommodate  the  com¬ 
puter  display,  thus  partially  changing  the  nature  of  the 
task.  (4)  The  quality  of  computer  graphics  may  affect 
the  comprehension  and  degree  of  difficulty  of  the  item. 
(5)  Pressing  a  key  or  using  a  mouse  may  be  easier  than 
marking  an  answer  sheet.  This  may  affect  the  validity  of 
speeded  tests.  (6)  Since  the  computer  typically  displays 
items  individually,  traditional  time  limits  are  no  longer 
necessary. 

Assuming  that  these  abstract  distinctions  may  affect  the 
equivalence  and  validity  of  computer-based  and  paper- 
based  assessment,  the  omission  of  items  and  backtrack¬ 
ing  on  paper-based  tests  in  this  research  were  not  per¬ 
mitted  in  order  to  simulate  computer-based  tests.  Com¬ 
puter  screen  capacity  was  of  no  consequence  in  this  study 
since  none  of  the  test  items  were  long.  The  graphics  used 
in  the  paper-based  recognition  test  were  screen  dumps  of 
the  actual  aircraft  silhouettes  used  in  its  computer-based 
counterpart.  That  is,  these  images  were  of  the  same  qual¬ 
ity.  In  this  study,  neither  the  computer-based  or  paper- 
based  measurement  employed  true  speeded  tests.  Also, 
we  attempted  to  mimic  the  individual  display  of  items  on 
the  computer-based  tests  used  in  this  research  by  closely 
monitoring  the  subjects  as  they  took  the  paper-based  test 
and  reminding  them  to  expedite  responding  without 
retracing. 

When  evaluating  or  comparing  different  media  for  in¬ 
struction  and  assessment,  the  newer  medium  may  simply 
be  perceived  as  being  more  interesting,  engaging,  and 
challenging  by  the  students.  This  novelty  effect  seems  to 
disappear  as  rapidly  as  it  appears.  However,  in  research 
studies  conducted  over  a  relatively  short  time  span,  for 
example,  a  few  days  or  months  at  the  most,  this  effect 
may  still  linger  and  affect  the  evaluation  by  enhancing  the 
impact  of  the  more  novel  medium  (Colvin  &  Clark,  !984); 
this  effect  could  have  occurred  in  our  present  research. 
When  matching  media  to  distinct  subject  matters,  course 
contents,  or  core  concepts,  some  research  evidence  (Jami¬ 
son,  Suppes,  &  Welles,  1974)  indicates  that,  other  than 
in  obvious  cases,  just  about  any  medium  will  be  effective 
for  different  content.  Extrapolating  this  notion  to  the  mea¬ 
surement  domain,  the  validity  results  of  this  study  seem 
to  suggest,  contrary  to  the  above,  that  different  media  may 
be  differentially  effective  in  testing  the  same  subject  matter. 
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